Skip to content

Add low precision attention API from torchao to TorchAoConfig#13285

Open
howardzhang-cv wants to merge 1 commit intohuggingface:mainfrom
howardzhang-cv:feature/fp8_attn_ao
Open

Add low precision attention API from torchao to TorchAoConfig#13285
howardzhang-cv wants to merge 1 commit intohuggingface:mainfrom
howardzhang-cv:feature/fp8_attn_ao

Conversation

@howardzhang-cv
Copy link
Copy Markdown
Contributor

@howardzhang-cv howardzhang-cv commented Mar 19, 2026

What does this PR do?

Adds low precision attention API from TorchAO to diffusers by updating TorchAoConfig with attn_backend option.
Note: this will require torchao 0.17.0

Todo:

Results:

flux.1-dev with 2048x2048 image size

Config Median Time (s) Speedup
bf16 baseline + torch.compile 3.17 1.00x
fp8_attn + compile 2.66 1.07x

Wan2.1-14B-Diffusers with 1280x720 frame size and 81 frames

Config Median Time (s) Speedup
bf16 baseline + torch.compile 81.32 1.00x
fp8_attn + compile 46.88 1.18x

Note that these results use a naive scheme, where every layer is quantized. Doing so results in subpar results when doing many inference steps (50 for WAN2.1). Using a better scheme (such as skipping early and late layers, quantizing 36/40 total layers) results in 1.12x speedup with much better quality. The VBench benchmarks for that are below:

image

@github-actions github-actions Bot added quantization size/M PR with diff < 200 LOC labels Apr 20, 2026
@howardzhang-cv howardzhang-cv marked this pull request as ready for review April 20, 2026 23:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

quantization size/M PR with diff < 200 LOC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant